Skip to content

Add Kata containers docs for 26.3.0 release#365

Open
a-mccarthy wants to merge 1 commit intoNVIDIA:mainfrom
a-mccarthy:coco-26.3.0
Open

Add Kata containers docs for 26.3.0 release#365
a-mccarthy wants to merge 1 commit intoNVIDIA:mainfrom
a-mccarthy:coco-26.3.0

Conversation

@a-mccarthy
Copy link
Collaborator

No description provided.

@github-actions
Copy link

Documentation preview

https://nvidia.github.io/cloud-native-docs/review/pr-365

* Transparent deployment of unmodified containers.

****************************
Limitations and Restrictions
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manuelh-dev are these limitations still correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments on various of these.

#. Specify at least the following options when you install the Operator.
If you want to run Kata Containers by default on all worker nodes, also specify ``--set sandboxWorkloads.defaultWorkload=vm-passthrough``.

.. code-block:: console
Copy link
Collaborator Author

@a-mccarthy a-mccarthy Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the upstream doc calls out enabling NFD in the install command (and also disabling it in the kata-deploy install). Is that needed? can you elaborate on why users should include those?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jojimt - can you help here? see https://github.com/kata-containers/kata-containers/pull/12651/changes on what we currently suggest in the Kata docs



*********************
Run a Sample Workload
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any updates needed for the sample app?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left relevant comments further in line

* The ``kata-qemu-nvidia-gpu`` runtime class is used with Kata Containers.

* The ``kata-qemu-nvidia-gpu-snp`` runtime class is used with Confidential Containers and is installed by default even though it is not used with this configuration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manuelh-dev When you install kata-deploy, there are more nvidia runtimes listed and just these 2. and your doc calles out kata-qemu-nvidia-gpu-tdx--Should we include anything about this runtime? what does it to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, for the Kata doc here, let's just emphasize that it deploys the kata-qemu-nvidia-gpu runtime class and not much talk about other runtime classes?

We should add that if the nvidia-cc-manager pod is running after you deployed, you may need to change the mode to sandbox workloads only (in this case you have deployed the solution on CC capable hardware). We can then reference to the sibling Coco doc which already explains how to mode-switch!?

NVIDIA's approach to the Confidential Containers architecture delivers on the key promise of Confidential Computing: confidentiality, integrity, and verifiability.
Integrating open source and NVIDIA software components with the Confidential Computing capabilities of NVIDIA GPUs, the Reference Architecture for Confidential Containers is designed to be the secure and trusted deployment model for AI workloads.

.. image:: graphics/CoCo-Reference-Architecture.png
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor concern: Do we ever really reference the two illustrations? At least in "Software Components for Confidential Containers" we could refer back to the illustration.

The illustrations are somewhat 'dangling' and not well-explained. Maybe this can be fixed by moving them to where we actually reference these (maybe I was not reading well enough)?


http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be good to rebase against 3709f11 and solve potential conflicts


Following is the platform and feature support scope for Early Access (EA) of Confidential Containers open Reference Architecture published by NVIDIA.

.. flat-table:: Supported Platforms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not duplication of above?

| - NVIDIA Confidential Computing Manager for Kubernetes
| - NVIDIA Kata Manager for Kubernetes
- v25.10.0 and higher
* - CoCo release (EA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional? I think for our latest stack we need a Kata 3.28 release.
I don't know what 'v0.18.0' is here and I ma not sure if we have the exact trustee/guest components in these versions. We are not using a concrete CoCo release. We are using a Kata release and this Kata release pull in CoCo components as dependencies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer the case with latest bits

@@ -242,4 +239,4 @@
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/manuelh-dev/kata-containers/blob/mahuber/doc-update-nvidia-gpu-op/docs/use-cases/NVIDIA-GPU-passthrough-and-Kata-QEMU.md:
The currently supported modes for enabling GPU workloads in the TEE scenario are: (1) single‑GPU passthrough (one physical GPU per pod) and (2) multi‑GPU passthrough on NVSwitch (NVLink) based HGX systems (for example, HGX Hopper (SXM) and HGX Blackwell / HGX B200).

@@ -242,4 +239,4 @@
* Support is limited to initial installation and configuration only. Upgrade and configuration of existing clusters to configure confidential computing is not supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer the case. We have a new component, the kata lifecycle manager https://github.com/kata-containers/lifecycle-manager - this component is currently in tag v0.1.2 but will still need to be incremented

@@ -242,4 +239,4 @@
* Support is limited to initial installation and configuration only. Upgrade and configuration of existing clusters to configure confidential computing is not supported.
* Support for confidential computing environments is limited to the implementation described on this page.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know what this means. Does this mean we only support Intel and AMD?

* Support is limited to initial installation and configuration only. Upgrade and configuration of existing clusters to configure confidential computing is not supported.
* Support for confidential computing environments is limited to the implementation described on this page.
* NVIDIA supports the GPU Operator and confidential computing with the containerd runtime only.
* NFD doesn't label all Confidential Container capable nodes as such automatically. In some cases, users must manually label nodes to deploy the NVIDIA Confidential Computing Manager for Kubernetes operand onto these nodes as described in the deployment guide.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer the case

* Run ``sudo update-grub`` after making the change to configure the bootloader. Reboot the host after configuring the bootloader.

* You have a Kubernetes cluster and you have cluster administrator privileges.
* For this cluster, you are using containerd 2.1 and Kubernetes version v1.34. These versions have been validated with the kata-containers project and are recommended. You use a ``runtimeRequestTimeout`` of more than 5 minutes in your `kubelet configuration <https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/>`_ (the current method to pull container images within the confidential container may exceed the two minute default timeout in case of using large container images).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not rendered properly. Need to add an empty line

* Run ``sudo update-grub`` after making the change to configure the bootloader. Reboot the host after configuring the bootloader.

* You have a Kubernetes cluster and you have cluster administrator privileges.
* For this cluster, you are using containerd 2.1 and Kubernetes version v1.34. These versions have been validated with the kata-containers project and are recommended. You use a ``runtimeRequestTimeout`` of more than 5 minutes in your `kubelet configuration <https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/>`_ (the current method to pull container images within the confidential container may exceed the two minute default timeout in case of using large container images).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check on containerd and Kubernetes version based on what we author in overview.rst. There may be discrepancies


This step ensures that you can continue to run traditional container workloads with GPU or vGPU workloads on some nodes in your cluster. Alternatively, you can set a default sandbox workload parameter to vm-passthrough to run confidential containers on all worker nodes when you install the GPU Operator.

2. Install the latest Kata Containers helm chart (minimum version: 3.24.0).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.28.0


This step installs all required components from the Kata Containers project including the Kata Containers runtime binary, runtime configuration, UVM kernel and initrd that NVIDIA uses for confidential containers and native Kata containers.

3. Install the latest version of the NVIDIA GPU Operator (minimum version: v25.10.0).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

26.3


$ kubectl label node <node-name> nvidia.com/gpu.workload.config=vm-passthrough

2. Use the 3.24.0 Kata Containers version and chart in environment variables::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.28

--create-namespace \
-f "https://raw.githubusercontent.com/kata-containers/kata-containers/refs/tags/${VERSION}/tools/packaging/kata-deploy/helm-chart/kata-deploy/try-kata-nvidia-gpu.values.yaml" \
--set nfd.enabled=false \
--set shims.qemu-nvidia-gpu-tdx.enabled=false \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/kata-containers/kata-containers/pull/12651/changes, this is updated. Let's use the command from there


*Example Output*::

Pulled: ghcr.io/kata-containers/kata-deploy-charts/kata-deploy:3.24.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to change that as well.


5. Verify that the kata-qemu-nvidia-gpu and kata-qemu-nvidia-gpu-snp runtime classes are available::

$ kubectl get runtimeclass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-tdx should now also be present

REVISION: 1
TEST SUITE: None

Note that, for heterogeneous clusters with different GPU types, you can omit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvidia-sandbox-validator-6xnzc 1/1 Running 1 30s
nvidia-vfio-manager-h229x 1/1 Running 0 62s

4. If the nvidia-cc-manager is *not* running, you need to label your CC-capable node(s) by hand. The node labelling capabilities in the early access version are not complete. To label your node(s), run::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed. Node capability detection works now

Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau

b. Confirm that the kata-deploy functionality installed the kata-qemu-nvidia-gpu-snp and kata-qemu-nvidia-gpu runtime class files::
Copy link
Contributor

@manuelh-dev manuelh-dev Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-tdx as well. Maybe we should say "installed the relevant runtime classes":

Note the double "::" - not quite consistent with all other bullet points.


$ ls -l /opt/kata/share/defaults/kata-containers/ | grep nvidia

*Example Output*::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to refresh. This looks odd in terms of file sizes and dates


b. Confirm that the kata-deploy functionality installed the kata-qemu-nvidia-gpu-snp and kata-qemu-nvidia-gpu runtime class files::

$ ls -l /opt/kata/share/defaults/kata-containers/ | grep nvidia
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These directories have changed. Need to refresh


c. Confirm that the kata-deploy functionality installed the UVM components::

$ ls -l /opt/kata/share/kata-containers/ | grep nvidia
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's refresh this as well


1. Create a file, such as the following cuda-vectoradd-kata.yaml sample, specifying the kata-qemu-nvidia-gpu-snp runtime class:

.. code-block:: yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at: https://github.com/kata-containers/kata-containers/pull/12651/changes - the flow has slightly changed with 'echo'

@@ -94,7 +94,6 @@
* NVIDIA Confidential Computing Manager (cc-manager) for Kubernetes - to set the confidential computing (CC) mode on the NVIDIA GPUs.
* NVIDIA Sandbox Device Plugin - to discover NVIDIA GPUs along with their capabilities, to advertise these to Kubernetes, and to allocate GPUs during pod deployment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kata-containers/kata-containers/pull/12651/changes has an updated version on the description of the sandbox device plugin

The page describes deploying Confidential Containers with the NVIDIA GPU Operator.
The implementation relies on the Kata Containers project to provide the lightweight utility Virtual Machines (UVMs) that feel and perform like containers but provide strong workload isolation.

Refer to the `Confidential Containers overview <https://docs.nvidia.com/datacenter/cloud-native/confidential-containers/latest/overview.html>`_ for details on the reference architecture and supported platforms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General remark: let's make sure we mention SNP and TDX in parity. I think I caught all prior occurrences where we only talked about SNP, but good to double check in general

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: https://github.com/kata-containers/kata-containers/pull/12651/changes has a paragraph/section called "feature set" - to discuss whether we want that here as well, or in the other deployment instructions?


* Run ``sudo update-grub`` after making the change to configure the bootloader. Reboot the host after configuring the bootloader.

* You have a Kubernetes cluster and you have cluster administrator privileges.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we note that the NVIDIA shim comes with a 20min timeout and that clusters used for NVIDIA testing in CI also use a 20 min timeout to pull very large container images?

Hardware virtualization and a separate kernel provide improved workload isolation
in comparison with traditional containers.

The NVIDIA GPU Operator works with the Kata container runtime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should discuss internally how we want to position this. At least for the CoCo docs we wanted to make the GPU operator a bit less prevalent

Limitations and Restrictions
****************************

* GPUs are available to containers as a single GPU in passthrough mode only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need change. To follow up offline about this. For CoCo multi-gpu passthrough is partially supported. Let's talk

* GPUs are available to containers as a single GPU in passthrough mode only.
Multi-GPU passthrough and vGPU are not supported.

* Support is limited to initial installation and configuration only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in the CoCo doc about the kata lifecycle manager component :)

* Support is limited to initial installation and configuration only.
Upgrade and configuration of existing clusters for Kata Containers is not supported.

* Support for Kata Containers is limited to the implementation described on this page.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question on the support for Red Hat OpenShift - do we need/have that on the CoCo deploy page as well?

* ``Node Feature Discovery`` -- to detect CPU, kernel, and host features and label worker nodes.
* ``NVIDIA GPU Feature Discovery`` -- to detect NVIDIA GPUs and label worker nodes.
- * ``NVIDIA Sandbox Device Plugin`` -- to discover and advertise the passthrough GPUs to kubelet.
* ``NVIDIA Confidential Computing Manager for Kubernetes`` -- to set the confidential computing (CC) mode on the NVIDIA GPUs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big IF here: Appears only if you deploy on CC hardware. We should clarify here as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this determined based on a label? or some way that is visible/checkable to users? our should we reference the cc supported platforms section for users to check that their nodes will get the CC manager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should maybe not even list this here as this is kata-containers-deploy.rst. We should note however, that if this pod appears, this means you have deployed Kata on CC capable hardware, which implies that cc-manager will start and by default transition the GPUs into confidential mode.

We can potentially/likely control this behavior by changing the GPU operator deployment instructions, so that cc-manager is disabled or by default transitions the GPUs into non-confidential mode. @jojimt can you help here possibly?

===================

Install the kata-deploy Helm chart.
Minimum required version is 3.24.0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.28

TEST SUITE: None
For heterogeneous clusters with different GPU types, you can specify an empty `P_GPU_ALIAS` environment variable for the sandbox device plugin, ``--set 'sandboxDevicePlugin.env[0].name=P_GPU_ALIAS'`` and ``--set 'sandboxDevicePlugin.env[0].value=""'``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will want a version like in https://github.com/kata-containers/kata-containers/pull/12651/changes - to confirm if/whether we will want exactly the same not for Coco and Kata. To discuss

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Intel EMR / GNR

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Ubuntu 25.10

* ``NVIDIA MIG Manager for Kubernetes`` -- to manage MIG-capable GPUs.
* ``Node Feature Discovery`` -- to detect CPU, kernel, and host features and label worker nodes.
* ``NVIDIA GPU Feature Discovery`` -- to detect NVIDIA GPUs and label worker nodes.
- * ``NVIDIA Sandbox Device Plugin`` -- to discover and advertise the passthrough GPUs to kubelet.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manuelh-dev how is this and the nvidia-kata-sandbox-device-plugin related? do they both get deployed? or only one when kata mode is enabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, * ``NVIDIA Sandbox Device Plugin`` -- to discover and advertise the passthrough GPUs to kubelet. is the nvidia-kata-sandbox-device-plugin - @jojimt @rajatchopra please correct me if I'm wrong

Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants